A bank is concerned that more and more customers are leaving its credit card services. They would really appreciate if someone could analyze it for them, in order to understand the main reasons for leaving the services, and to come up with recommendations for how the bank can mitigate that. Eventually, the bank would like to proactively implement these recommendations in order to keep their customers happy.
A full ERD can be found here
In this task, few datasets are provided:
BankChurners.csv - this file contains basic information about each client (10 columns). The columns are:
CLIENTNUM - Client number. Unique identifier for the customer holding the account;Attrition Flag - Internal event (customer activity) variable - if the client had churned (attrited) or not (existing).Dependent Count - Demographic variable - Number of dependentsCard_Category - Product Variable - Type of Card (Blue, Silver, Gold, Platinum)Months_on_book - Period of relationship with bankMonths_Inactive_12_mon - No. of months inactive in the last 12 monthsContacts_Count_12_mon - No. of Contacts in the last 12 monthsCredit_Limit - Credit Limit on the Credit CardAvg_Open_To_Buy - Open to Buy Credit Line (Average of last 12 months)Avg_Utilization_Ratio - Average Card Utilization Ratiobasic_client_info.csv - this file contains some basic client info per each client
(6 columns) -
- CLIENTNUM - Client number. Unique identifier for the customer holding the account
- Customer Age - Demographic variable - Customer's Age in Years
- Gender - Demographic variable - M=Male, F=Female
- Education_Level - Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.-Marital_Status- Demographic variable - Married, Single, Divorced, Unknown -Income_Category` - Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)
enriched_churn_data.csv - this file contains some enriched data about each client (7 columns) -CLIENTNUM - Client number. Unique identifier for the customer holding the accountTotal_Relationship_Count - Total no. of products held by the customerTotal_Revolving_Bal - Total Revolving Balance on the Credit CardTotal_Amt_Chng_Q4_Q1 - Change in Transaction Amount (Q4 over Q1)Total_Trans_Amt - Total Transaction Amount (Last 12 months)Total_Trans_Ct - Total Transaction Count (Last 12 months)Total_Ct_Chng_Q4_Q1 - Change in Transaction Count (Q4 over Q1)# import necessary libraries
import csv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import plotly.express as px
import plotly.io as pio
import plotly.graph_objs as go
pio.renderers.default = 'notebook'
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
# import from csv
data0 = pd.read_csv('bankchurners.csv')
data1 = pd.read_csv('enriched_churn_data.csv')
data2 = pd.read_csv('basic_client_info.csv')
# merge all tables
data01 = pd.merge(data0, data1, on='clientnum')
data = pd.merge(data01, data2, on='clientnum')
df = pd.DataFrame(data=data)
pd.set_option('max_colwidth', 10)
pd.set_option('display.max_columns', 21)
df.sample(10)
| clientnum | attrition_flag | dependent_count | card_category | months_on_book | months_inactive_12_mon | contacts_count_12_mon | credit_limit | avg_open_to_buy | avg_utilization_ratio | total_relationship_count | total_revolving_bal | total_amt_chng_q4_q1 | total_trans_amt | total_ct_chng_q4_q1 | total_trans_ct | customer_age | gender | education_level | marital_status | income_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5496 | 717395733 | Existi... | 3 | Blue | 36 | 4 | 3 | 4142.0 | 3454.0 | 0.166 | 4 | 688 | 0.717 | 4512 | 0.976 | 81 | 42 | F | Doctorate | Married | $40K -... |
| 3753 | 779348358 | Attrit... | 1 | Blue | 37 | 3 | 4 | 1717.0 | 1717.0 | 0.000 | 5 | 0 | 1.028 | 3004 | 0.586 | 46 | 52 | F | Uneduc... | Single | Less t... |
| 9450 | 771190083 | Existi... | 3 | Blue | 23 | 2 | 1 | 2410.0 | 705.0 | 0.707 | 6 | 1705 | 0.798 | 2505 | 0.568 | 69 | 36 | F | Post-G... | Single | Less t... |
| 7374 | 789430083 | Existi... | 5 | Blue | 32 | 1 | 2 | 14987.0 | 13562.0 | 0.095 | 6 | 1425 | 0.800 | 4470 | 0.844 | 83 | 41 | F | High S... | Single | $40K -... |
| 815 | 716123283 | Existi... | 1 | Blue | 48 | 1 | 2 | 7288.0 | 7288.0 | 0.000 | 3 | 0 | 0.640 | 14873 | 0.714 | 120 | 59 | M | Unknown | Single | $60K -... |
| 2623 | 716769783 | Existi... | 2 | Blue | 40 | 2 | 4 | 5104.0 | 4368.0 | 0.144 | 3 | 736 | 0.679 | 3943 | 0.696 | 78 | 48 | F | Graduate | Married | Unknown |
| 9763 | 709858533 | Existi... | 2 | Blue | 20 | 1 | 2 | 2462.0 | 441.0 | 0.821 | 4 | 2021 | 0.966 | 2320 | 0.676 | 57 | 29 | M | Graduate | Divorced | Less t... |
| 2597 | 780717483 | Existi... | 3 | Blue | 40 | 3 | 3 | 7540.0 | 6767.0 | 0.103 | 6 | 773 | 0.688 | 4755 | 0.706 | 58 | 51 | F | High S... | Married | Less t... |
| 2170 | 807971658 | Existi... | 3 | Blue | 42 | 2 | 3 | 3669.0 | 1152.0 | 0.686 | 1 | 2517 | 0.830 | 13931 | 0.836 | 112 | 47 | F | Unknown | Married | Less t... |
| 4148 | 710406633 | Attrit... | 3 | Blue | 36 | 3 | 2 | 3664.0 | 3664.0 | 0.000 | 3 | 0 | 0.839 | 1168 | 0.409 | 31 | 51 | F | Uneduc... | Married | Unknown |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 clientnum 10127 non-null int64 1 attrition_flag 10127 non-null object 2 dependent_count 10127 non-null int64 3 card_category 10127 non-null object 4 months_on_book 10127 non-null int64 5 months_inactive_12_mon 10127 non-null int64 6 contacts_count_12_mon 10127 non-null int64 7 credit_limit 10127 non-null float64 8 avg_open_to_buy 10127 non-null float64 9 avg_utilization_ratio 10127 non-null float64 10 total_relationship_count 10127 non-null int64 11 total_revolving_bal 10127 non-null int64 12 total_amt_chng_q4_q1 10127 non-null float64 13 total_trans_amt 10127 non-null int64 14 total_ct_chng_q4_q1 10127 non-null float64 15 total_trans_ct 10127 non-null int64 16 customer_age 10127 non-null int64 17 gender 10127 non-null object 18 education_level 10127 non-null object 19 marital_status 10127 non-null object 20 income_category 10127 non-null object dtypes: float64(5), int64(10), object(6) memory usage: 1.7+ MB
#make a copy
df_clean = df.copy()
#check duplicates
df_clean.duplicated().sum()
0
#check nulls
df_clean.attrition_flag.isna().sum()
0
## Pre-processing
cat_cols = [c for c in df_clean.columns if df_clean[c].dtype == 'object']
cat_cols
['attrition_flag', 'card_category', 'gender', 'education_level', 'marital_status', 'income_category']
num_cols = [n for n in df_clean.columns if df_clean[n].dtype != 'object']
num_cols
['clientnum', 'dependent_count', 'months_on_book', 'months_inactive_12_mon', 'contacts_count_12_mon', 'credit_limit', 'avg_open_to_buy', 'avg_utilization_ratio', 'total_relationship_count', 'total_revolving_bal', 'total_amt_chng_q4_q1', 'total_trans_amt', 'total_ct_chng_q4_q1', 'total_trans_ct', 'customer_age']
df_clean[num_cols].describe()
| clientnum | dependent_count | months_on_book | months_inactive_12_mon | contacts_count_12_mon | credit_limit | avg_open_to_buy | avg_utilization_ratio | total_relationship_count | total_revolving_bal | total_amt_chng_q4_q1 | total_trans_amt | total_ct_chng_q4_q1 | total_trans_ct | customer_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.0127... | 10127.... | 10127.... | 10127.... | 10127.... | 10127.... | 10127.... | 10127.... | 10127.... | 10127.... | 10127.... | 10127.... | 10127.... | 10127.... | 10127.... |
| mean | 7.3917... | 2.346203 | 35.928409 | 2.341167 | 2.455317 | 8631.9... | 7469.1... | 0.274894 | 3.812580 | 1162.8... | 0.759941 | 4404.0... | 0.712222 | 64.858695 | 46.325960 |
| std | 3.6903... | 1.298908 | 7.986416 | 1.010622 | 1.106225 | 9088.7... | 9090.6... | 0.275691 | 1.554408 | 814.98... | 0.219207 | 3397.1... | 0.238086 | 23.472570 | 8.016814 |
| min | 7.0808... | 0.000000 | 13.000000 | 0.000000 | 0.000000 | 1438.3... | 3.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 510.00... | 0.000000 | 10.000000 | 26.000000 |
| 25% | 7.1303... | 1.000000 | 31.000000 | 2.000000 | 2.000000 | 2555.0... | 1324.5... | 0.023000 | 3.000000 | 359.00... | 0.631000 | 2155.5... | 0.582000 | 45.000000 | 41.000000 |
| 50% | 7.1792... | 2.000000 | 36.000000 | 2.000000 | 2.000000 | 4549.0... | 3474.0... | 0.176000 | 4.000000 | 1276.0... | 0.736000 | 3899.0... | 0.702000 | 67.000000 | 46.000000 |
| 75% | 7.7314... | 3.000000 | 40.000000 | 3.000000 | 3.000000 | 11067.... | 9859.0... | 0.503000 | 5.000000 | 1784.0... | 0.859000 | 4741.0... | 0.818000 | 81.000000 | 52.000000 |
| max | 8.2834... | 5.000000 | 56.000000 | 6.000000 | 6.000000 | 34516.... | 34516.... | 0.999000 | 6.000000 | 2517.0... | 3.397000 | 18484.... | 3.714000 | 139.00... | 73.000000 |
df_clean.attrition_flag.value_counts() #below the anual churning rate?
Existing Customer 8500 Attrited Customer 1627 Name: attrition_flag, dtype: int64
# Define a dictionary mapping the categories to binary values
dict_ch = {
'Existing Customer': 0,
'Attrited Customer': 1,
}
# Use the map() method to replace the values in the 'attrition_flag' column
df_clean['churned'] = df_clean['attrition_flag'].map(dict_ch)
# Drop the original 'attrition_flag' column
df_clean.drop('attrition_flag', axis=1, inplace=True)
df_clean
| clientnum | dependent_count | card_category | months_on_book | months_inactive_12_mon | contacts_count_12_mon | credit_limit | avg_open_to_buy | avg_utilization_ratio | total_relationship_count | total_revolving_bal | total_amt_chng_q4_q1 | total_trans_amt | total_ct_chng_q4_q1 | total_trans_ct | customer_age | gender | education_level | marital_status | income_category | churned | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 806160108 | 1 | Blue | 56 | 2 | 3 | 3193.0 | 676.0 | 0.788 | 2 | 2517 | 1.831 | 1336 | 1.143 | 30 | 61 | M | High S... | Married | $40K -... | 0 |
| 1 | 804424383 | 1 | Blue | 56 | 3 | 2 | 10215.0 | 9205.0 | 0.099 | 3 | 1010 | 0.843 | 1904 | 1.000 | 40 | 63 | M | Unknown | Married | $60K -... | 0 |
| 2 | 708300483 | 0 | Blue | 56 | 4 | 3 | 7882.0 | 7277.0 | 0.077 | 5 | 605 | 1.052 | 704 | 0.143 | 16 | 66 | F | Doctorate | Married | Unknown | 1 |
| 3 | 808284783 | 1 | Blue | 56 | 0 | 0 | 1438.3 | 1438.3 | 0.000 | 6 | 0 | 0.813 | 1951 | 1.095 | 44 | 62 | F | Unknown | Married | Less t... | 0 |
| 4 | 712720158 | 1 | Blue | 56 | 2 | 3 | 13860.0 | 12208.0 | 0.119 | 5 | 1652 | 1.255 | 1910 | 1.909 | 32 | 68 | M | Graduate | Married | Unknown | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10122 | 714082308 | 3 | Blue | 13 | 1 | 3 | 8656.0 | 8656.0 | 0.000 | 2 | 0 | 0.893 | 8056 | 0.935 | 89 | 39 | M | High S... | Single | $80K -... | 1 |
| 10123 | 710751033 | 1 | Blue | 13 | 1 | 3 | 4829.0 | 4829.0 | 0.000 | 1 | 0 | 0.836 | 8286 | 0.857 | 65 | 31 | F | Graduate | Single | Less t... | 1 |
| 10124 | 780118158 | 0 | Blue | 13 | 2 | 4 | 3585.0 | 1165.0 | 0.675 | 2 | 2420 | 0.812 | 15656 | 0.746 | 124 | 27 | M | Graduate | Single | Less t... | 0 |
| 10125 | 779082633 | 2 | Blue | 13 | 2 | 3 | 4107.0 | 3128.0 | 0.238 | 3 | 979 | 0.647 | 14596 | 0.733 | 104 | 30 | M | Uneduc... | Single | $60K -... | 0 |
| 10126 | 788633733 | 3 | Silver | 13 | 3 | 2 | 28174.0 | 27001.0 | 0.042 | 4 | 1173 | 0.779 | 16258 | 0.714 | 108 | 36 | M | Graduate | Married | $60K -... | 0 |
10127 rows × 21 columns
# Calculate the percentage of churned customers
num_churned = df_clean[df_clean["churned"] == 1].shape[0]
num_total = df_clean.shape[0]
percent_churned = (num_churned / num_total) * 100
# Format the result as a percentage
print("Percentage of churned customers: {:.2f}%".format(percent_churned))
Percentage of churned customers: 16.07%
The annual churn rate for credit cards can vary greatly depending on the type of card and the target demographic. On average, it is common to see an annual churn rate of around 20% for credit cards. However, some high-end, premium credit cards may have a lower churn rate, while cards targeted towards subprime borrowers may have a higher churn rate.
It's important to note that a lower churn rate is typically seen as a positive sign for a credit card issuer, as it indicates that customers are satisfied with the card and less likely to switch to a different one. On the other hand, a higher churn rate can indicate that customers are not satisfied with the card or are switching to a different one with more attractive terms.
A preliminary overview of the dataset shows that around 83.93% of customers are active, while 16.07% have churned.
#Convert 'attrition_flag'to boolean dtype
dict_g = {
'F': ['Female'],
'M': [' Male'],
}
df_clean['gender'] = df_clean['gender']
df_clean.replace({"gender": dict_g}, inplace=True)
df.describe(include='object')
| attrition_flag | card_category | gender | education_level | marital_status | income_category | |
|---|---|---|---|---|---|---|
| count | 10127 | 10127 | 10127 | 10127 | 10127 | 10127 |
| unique | 2 | 4 | 2 | 7 | 4 | 6 |
| top | Existi... | Blue | F | Graduate | Married | Less t... |
| freq | 8500 | 9436 | 5358 | 3128 | 4687 | 3561 |
df_clean.describe().loc[['min', 'max', 'mean']].round(2)
| clientnum | dependent_count | months_on_book | months_inactive_12_mon | contacts_count_12_mon | credit_limit | avg_open_to_buy | avg_utilization_ratio | total_relationship_count | total_revolving_bal | total_amt_chng_q4_q1 | total_trans_amt | total_ct_chng_q4_q1 | total_trans_ct | customer_age | churned | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| min | 7.0808... | 0.00 | 13.00 | 0.00 | 0.00 | 1438.30 | 3.00 | 0.00 | 1.00 | 0.00 | 0.00 | 510.00 | 0.00 | 10.00 | 26.00 | 0.00 |
| max | 8.2834... | 5.00 | 56.00 | 6.00 | 6.00 | 34516.00 | 34516.00 | 1.00 | 6.00 | 2517.00 | 3.40 | 18484.00 | 3.71 | 139.00 | 73.00 | 1.00 |
| mean | 7.3917... | 2.35 | 35.93 | 2.34 | 2.46 | 8631.95 | 7469.14 | 0.27 | 3.81 | 1162.81 | 0.76 | 4404.09 | 0.71 | 64.86 | 46.33 | 0.16 |
df_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 clientnum 10127 non-null int64 1 dependent_count 10127 non-null int64 2 card_category 10127 non-null object 3 months_on_book 10127 non-null int64 4 months_inactive_12_mon 10127 non-null int64 5 contacts_count_12_mon 10127 non-null int64 6 credit_limit 10127 non-null float64 7 avg_open_to_buy 10127 non-null float64 8 avg_utilization_ratio 10127 non-null float64 9 total_relationship_count 10127 non-null int64 10 total_revolving_bal 10127 non-null int64 11 total_amt_chng_q4_q1 10127 non-null float64 12 total_trans_amt 10127 non-null int64 13 total_ct_chng_q4_q1 10127 non-null float64 14 total_trans_ct 10127 non-null int64 15 customer_age 10127 non-null int64 16 gender 10127 non-null object 17 education_level 10127 non-null object 18 marital_status 10127 non-null object 19 income_category 10127 non-null object 20 churned 10127 non-null int64 dtypes: float64(5), int64(11), object(5) memory usage: 1.7+ MB
corr_matrix = df_clean.corr(numeric_only=True)
fig = px.imshow(corr_matrix,
color_continuous_scale='YlOrRd',
aspect="auto",
title='Correlation Matrix')
fig.update_traces(
text=corr_matrix.round(2),
texttemplate="%{text:.2f}",
textfont_size=12,
hovertemplate=
"Feature 1: %{y}<br>Feature 2: %{x}<br>Correlation: %{text:.2f}<extra></extra>",
customdata=np.moveaxis(corr_matrix.values, 0, -1))
fig.update_xaxes(tickangle=90, tickfont=dict(size=12), title="")
fig.update_yaxes(tickfont=dict(size=12), title="")
fig.update_coloraxes(showscale=True, colorbar=dict(len=0.4, y=0.75))
fig.show()
# filter for churned
df_churned = df_clean[df_clean['churned'] == 1]
corr_matrix = df_churned.corr(numeric_only=True)
fig = px.imshow(corr_matrix,
color_continuous_scale='YlOrRd',
aspect="auto",
title='Correlation Matrix for Churned')
fig.update_traces(
text=corr_matrix.round(2),
texttemplate="%{text:.2f}",
textfont_size=12,
hovertemplate=
"Feature 1: %{y}<br>Feature 2: %{x}<br>Correlation: %{text:.2f}<extra></extra>",
customdata=np.moveaxis(corr_matrix.values, 0, -1))
fig.update_xaxes(tickangle=90, tickfont=dict(size=12), title="")
fig.update_yaxes(tickfont=dict(size=12), title="")
fig.update_coloraxes(showscale=True, colorbar=dict(len=0.4, y=0.75))
fig.show()
X = df_clean.drop('churned', axis=1)
y = df_clean['churned']
# One-hot encode categorical variables
X_encoded = pd.get_dummies(X,
columns=[
'card_category', 'gender', 'education_level',
'marital_status', 'income_category'
])
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded,
y,
test_size=0.2,
random_state=42)
# Train the XGBoost model
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_train, y_train)
# Evaluate the model on the testing set
y_pred = xgb_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)
print('Accuracy:', accuracy)
print('Confusion matrix:', confusion_mat)
# Visualize the model's feature importances
importances = pd.Series(xgb_model.feature_importances_,
index=X_encoded.columns)
importances_nonzero = importances[importances != 0].sort_values(ascending=True)
# Create a horizontal bar chart of feature importances
fig = go.Figure(
go.Bar(x=importances_nonzero.values,
y=importances_nonzero.index,
orientation='h'))
fig.update_layout(title='Feature Importances (XGBoost)',
xaxis_title='F score',
yaxis_title='Feature')
fig.show()
Accuracy: 0.9772951628825272 Confusion matrix: [[1701 19] [ 27 279]]
# Retrain the XGBoost model with the most important features
important_features = importances_nonzero.tail(10).index.tolist()
X_important = X_encoded[important_features]
X_train_important, X_test_important, y_train, y_test = train_test_split(
X_important, y, test_size=0.2, random_state=42)
xgb_model_important = xgb.XGBClassifier()
xgb_model_important.fit(X_train_important, y_train)
# Evaluate the model on the testing set
y_pred = xgb_model_important.predict(X_test_important)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)
print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1)
print('Confusion matrix:', confusion_mat)
Accuracy: 0.9763079960513327 Precision: 0.9358108108108109 Recall: 0.9052287581699346 F1 Score: 0.920265780730897 Confusion matrix: [[1701 19] [ 29 277]]
# Create an XGBoost classifier with default hyperparameters
clf = xgb.XGBClassifier()
# Perform 5-fold cross-validation on the data
scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='f1')
# Print the average F1 score across all folds
print('Average F1 score:', scores.mean())
Average F1 score: 0.8996300666012841
# initialize the logistic regression model
lr_model = LogisticRegression()
# train the model on the training data
lr_model.fit(X_train, y_train)
# make predictions on the testing data
y_pred = lr_model.predict(X_test)
# evaluate the model's performance
print(classification_report(y_test, y_pred, zero_division=1))
precision recall f1-score support
0 0.85 1.00 0.92 1720
1 1.00 0.00 0.00 306
accuracy 0.85 2026
macro avg 0.92 0.50 0.46 2026
weighted avg 0.87 0.85 0.78 2026
# Create the DMatrix
dmatrix = xgb.DMatrix(data=X_encoded, label=y)
# Create the parameter dictionary: params
params = {"objective": "binary:logistic", "max_depth": 2}
# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=dmatrix, num_boost_round=10)
# Plot the first tree
fig = plt.figure(figsize=(25, 20), dpi=300)
xgb.plot_tree(xg_reg, num_trees=0, ax=plt.gca(), fontsize=14)
plt.show()
# Plot the fifth tree
fig = plt.figure(figsize=(25, 20), dpi=300)
xgb.plot_tree(xg_reg, num_trees=4, ax=plt.gca(), fontsize=14)
plt.show()
# Plot distribution of total transaction count by attrition flag
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df_clean,
x='total_trans_ct',
hue='churned',
fill=True,
alpha=0.5,
palette='cool')
plt.title('Distribution of Total Transaction Count by Attrition Flag')
plt.xlabel('Total Transaction Count')
plt.ylabel('Density')
plt.show()
churned_trans_by_gender_age = df_clean[df_clean['churned'] == 1].groupby(
['gender'])['total_trans_ct'].describe()
pd.set_option('max_colwidth', 20)
pd.set_option('display.max_columns', 10)
churned_trans_by_gender_age
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| gender | ||||||||
| Male | 697.0 | 46.110473 | 17.437841 | 10.0 | 35.0 | 44.0 | 57.0 | 91.0 |
| Female | 930.0 | 44.051613 | 11.909848 | 12.0 | 38.0 | 43.0 | 49.0 | 94.0 |
# Plot distribution of total revolving balance by attrition flag
plt.figure(figsize=(10, 6), dpi=250)
sns.kdeplot(data=df_clean,
x='total_revolving_bal',
hue='churned',
fill=True,
alpha=0.5,
palette='cool')
plt.title('Distribution of Total Revolving Balance by Attrition Flag')
plt.xlabel('Total Revolving Balance')
plt.ylabel('Density')
plt.show()
churned_bal_by_gender_age = df_clean[df_clean['churned'] == 1].groupby(
['gender'])['total_revolving_bal'].describe()
pd.set_option('max_colwidth', 20)
pd.set_option('display.max_columns', 10)
churned_bal_by_gender_age
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| gender | ||||||||
| Male | 697.0 | 680.315638 | 917.519035 | 0.0 | 0.0 | 0.0 | 1307.00 | 2517.0 |
| Female | 930.0 | 667.207527 | 924.726152 | 0.0 | 0.0 | 0.0 | 1298.25 | 2517.0 |
# Plot distribution of total relationship count by attrition flag
plt.figure(figsize=(10, 6), dpi=250)
sns.kdeplot(data=df_clean,
x='total_relationship_count',
hue='churned',
fill=True,
alpha=0.5,
palette='cool')
plt.title('Distribution of Total Relationship Count by Attrition Flag')
plt.xlabel('Total Relationship Count')
plt.ylabel('Density')
plt.show()
churned_count_by_relationship = df_clean[df_clean['churned'] == 1].groupby(
['total_relationship_count'])['clientnum'].count()
churned_count_by_relationship.describe()
count 6.000000 mean 271.166667 std 81.715156 min 196.000000 25% 225.500000 50% 230.000000 75% 317.750000 max 400.000000 Name: clientnum, dtype: float64
# Plot distribution of total transaction amount by attrition flag
plt.figure(figsize=(10, 6), dpi=250)
sns.kdeplot(data=df_clean,
x='total_trans_amt',
hue='churned',
fill=True,
alpha=0.5,
palette='cool')
plt.title('Distribution of Total Transaction Amount by Attrition Flag')
plt.xlabel('Total Transaction Amount')
plt.ylabel('Density')
plt.show()
churned_trans_by_gender_age = df_clean[df_clean['churned'] == 1].groupby(
['gender'])['total_trans_amt'].describe()
pd.set_option('max_colwidth', 20)
pd.set_option('display.max_columns', 10)
churned_trans_by_gender_age
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| gender | ||||||||
| Male | 697.0 | 3509.779053 | 2766.604342 | 530.0 | 1759.00 | 2270.0 | 4883.00 | 10583.0 |
| Female | 930.0 | 2784.183871 | 1834.421181 | 510.0 | 2013.25 | 2360.0 | 2645.75 | 10294.0 |
# Plot distribution of total count change by attrition flag
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df_clean,
x='total_ct_chng_q4_q1',
hue='churned',
fill=True,
alpha=0.5,
palette='cool')
plt.title('Distribution of Total Count Change by Attrition Flag')
plt.xlabel('Total Count Change Q4-Q1')
plt.ylabel('Density')
plt.show()
churned_total_ct_chng_q4_q1_by_gender = df_clean[
df_clean['churned'] == 1].groupby(['gender'
])['total_ct_chng_q4_q1'].describe()
pd.set_option('max_colwidth', 20)
pd.set_option('display.max_columns', 10)
churned_total_ct_chng_q4_q1_by_gender
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| gender | ||||||||
| Male | 697.0 | 0.579077 | 0.252603 | 0.0 | 0.4 | 0.552 | 0.735 | 2.500 |
| Female | 930.0 | 0.535881 | 0.203639 | 0.0 | 0.4 | 0.517 | 0.654 | 1.684 |
# Set style
sns.set_style('whitegrid')
# Create distribution plot of contacts count by attrition flag
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df_clean,
x='contacts_count_12_mon',
hue='churned',
fill=True,
alpha=0.5,
palette='cool')
plt.title(
'Distribution of Contacts Count in the Last 12 Months by Attrition Flag')
plt.xlabel('Contacts Count in the Last 12 Months')
plt.ylabel('Density')
plt.show()
# Import libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style('whitegrid')
# Create histogram of contacts count by attrition flag
plt.figure(figsize=(10, 6))
sns.histplot(data=df_clean,
x='contacts_count_12_mon',
hue='churned',
multiple='stack',
palette='cool')
plt.title('Distribution of Contacts Count in the Last 12 Months by Attrition Flag')
plt.xlabel('Contacts Count in the Last 12 Months')
plt.ylabel('Count')
plt.show()
# Plot distribution of months on book by attrition flag
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df_clean,
x='months_on_book',
hue='churned',
fill=True,
alpha=0.5,
palette='cool')
plt.title('Distribution of Months on Book by Attrition Flag')
plt.xlabel('Months on Book')
plt.ylabel('Density')
plt.show()